WDI and ggplot2
a3_123456.nb.html by
replacing 123456 with your ID)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose at least one indicator of WDI
WDIExplore the data using visualization using
ggplot2
Observations and difficulties encountered.
Due: 2023-01-16 23:59:00. Submit your R Notebook file in Moodle (The Third Assignment). Due on Monday!
Follow thw workflow explained in EDA4 on January 18.
In RStudio,
1.1. Project
project_name.Rproj in your
project folder (directory)1.2. data folder (directory) data
1.3. Move (or copy) data for the project to the data folder
data.data:
Press Files at the right bottom pane and click data, the
data folder.2.1. Project Notebook: Memo
Create an R Notebook: File > New File > R Notebook
Add descriptive title.
2.2. Setup Code Chunk
Create a code chunk and add packages to use in the project and RUN the code.
library(tidyverse)
library(WDI)When people read a paper, they want to know the minimal set of packages required to install. So it is better to load the packages by library() you actually need in the paper.
2.3. Choose Source or Visual editor mode,
and start editing Project Notebook
2.4. Edit a new file by saving as for a report
We should know first about the variables. At least you must know if each of the variables is a categorical variable or a numerical variable.
It is not a must, but it is good to run the following code chunk and
use wdi_cache for WDIsearch() to update the
mata data. See WDIcache help.
wdi_cache <- WDIcache()We obtain the information of the data, i.e., meta data of WDI: Name,
Indicator, Description, Source, etc. by WDIsearch.
WDIsearch(string = "gdp", field = "name", short = TRUE, cache = NULL)
Arguments
string: Character string. Search for this string
using grep with ignore.case=TRUE.
field: Character string. Search this field.
Admissible fields: ‘indicator’, ‘name’, ‘description’, ‘sourceDatabase’,
‘sourceOrganization’
short: TRUE: Returns only the indicator’s code and
name. FALSE: Returns the indicator’s code, name, description, and
source.
cache: Data list generated by the WDIcache function.
If omitted, WDIsearch will search a local list of series.
If you follow the order of the arguments, you can omit argument
names: string, field, short,
cache.
WDIsearch("NY.GNS.ICTR.ZS", "indicator")WDIsearch("Literacy rate", "name")WDIsearch("IP.JRN.ARTC.SC", "indicator", FALSE)wdi_cache downloaded above to use the
updated meta data.WDIsearch("Government expenditure on education", "name", FALSE, wdi_cache)WDI(country = "all", indicator = "NY.GDP.PCAP.KD", start = 1960, end = NULL, extra = FALSE, cache = NULL, latest = NULL, language = "en)WDI(indicator = "NY.GDP.PCAP.KD")df if you download and use only one data, but it
is better to assign the data to be a more descriptive name.df_gdppcap <- WDI(indicator = "NY.GDP.PCAP.KD")
df_gdppcapregion,
income, lending, use
extra = TRUE.df_gdppcap_extra <- WDI(indicator = "NY.GDP.PCAP.KD", extra = TRUE)
df_gdppcap_extraSince the data is generally huge, it is better to use a data saved in
your computer. Before saving it check whether you have data
folder (or directory) in you project folder. Use the Files
tab of the right bottom pane. You should be able to see the project icon
with Rproj at the end, your R Notebook file you are editing, and the
data folder. Since .csv is automatically
added, write_csv(gdppcap_extra, "data/gdppcap_extra") does
the same as below.
write_csv(gdppcap_extra, "data/gdppcap_extra.csv")To read the data, run the next code chunk. You do not have to run the
code
df_gdppcap_extra <- WDI(indicator = "NY.GDP.PCAP.KD", extra = TRUE)
again.
gdppcap <- read_csv("data/gdppcap_extra.csv")“data/gdpcap_extra.csv” is the way to express the name of the data
gdpcap_extra.csv in csv format in the data folder.
country
argumentIf you want to import data for several countries, you can use
iso2c codes of the countries.
ASEAN <- c("BN", "ID", "KH", "LA", "MM", "MY", "PH", "SG")
df_gdppcap_asean <- WDI(ASEAN, "NY.GDP.PCAP.KD")
df_gdppcap_asean %>% distinct(country) %>% pull()[1] "Brunei Darussalam" "Indonesia" "Cambodia"
[4] "Lao PDR" "Myanmar" "Malaysia"
[7] "Philippines" "Singapore"
wdi_cache$country %>% filter(iso2c %in% ASEAN) %>% distinct(country) %>% pull()[1] "Brunei Darussalam" "Indonesia" "Cambodia"
[4] "Lao PDR" "Myanmar" "Malaysia"
[7] "Philippines" "Singapore"
You can also use wdi_cache$country.
wdi_cache$countrywdi_cache$country %>% filter(iso2c %in% ASEAN) %>% pull(country)[1] "Brunei Darussalam" "Indonesia" "Cambodia"
[4] "Lao PDR" "Myanmar" "Malaysia"
[7] "Philippines" "Singapore"
You can find iso2c codes from the downloaded data, or using wdi_cache$country
wdi_cache$country %>%
filter(country %in% c("Brunei Darussalam", "Indonesia", "Cambodia", "Lao PDR", "Myanmar", "Malaysia", "Philippines", "Singapore")) %>%
pull(iso2c)[1] "BN" "ID" "KH" "LA" "MM" "MY" "PH" "SG"
You can also find countries in some category.
wdi_cache$country %>%
filter(region == "South Asia") %>%
pull(country)[1] "Afghanistan" "Bangladesh" "Bhutan" "India" "Sri Lanka"
[6] "Maldives" "Nepal" "Pakistan"
wdi_cache$country %>%
filter(income == "Lower middle income") %>%
pull(iso2c) [1] "AO" "BJ" "BD" "BO" "BT" "CI" "CM" "CG" "KM" "CV" "DJ" "DZ" "EG" "FM" "GH"
[16] "HN" "HT" "ID" "IN" "IR" "KE" "KG" "KH" "KI" "LA" "LB" "LK" "LS" "MA" "MM"
[31] "MN" "MR" "NG" "NI" "NP" "PK" "PH" "PG" "PS" "SN" "SB" "SV" "ST" "SZ" "TJ"
[46] "TL" "TN" "TZ" "UA" "UZ" "VN" "VU" "WS" "ZW"
The following shows a list of indicators
wdi_cache$seriesThe following is same as
WDIsearch("gdp per cap", "name"). See WDIsearch help.
wdi_cache$series %>% filter(grepl("gdp per cap", name, ignore.case = TRUE))wdi_cache$series %>% filter(indicator == "SG.GEN.PARL.ZS") %>% pull(name)[1] "Proportion of seats held by women in national parliaments (%)"
df_women_in_parl <- WDI(indicator = c(women_in_parl = "SG.GEN.PARL.ZS")) %>%
drop_na(women_in_parl)df_women_in_parl %>% filter(year == 2020) %>%
ggplot(aes(women_in_parl)) + geom_histogram()df_women_in_parl %>% filter(year == 2020) %>%
ggplot(aes(women_in_parl)) + geom_histogram(binwidth = 5)wdi_cache$series %>% filter(indicator == "IP.JRN.ARTC.SC") %>% pull(name)[1] "Scientific and technical journal articles"
df_stja <- WDI(indicator = c(stja = "IP.JRN.ARTC.SC"), extra = TRUE) %>%
drop_na(stja)Default of the number of bins is 30. Change and find an appropriate one.
df_stja$year %>% summary() Min. 1st Qu. Median Mean 3rd Qu. Max.
2000 2004 2009 2009 2014 2018
df_stja %>% filter(income != "Aggregates") %>%
ggplot(aes(stja)) + geom_histogram(bins = 20)df_stja %>% filter(income != "Aggregates", stja >0) %>%
ggplot(aes(stja)) + geom_histogram(bins = 20) + scale_x_log10()df_stja %>% filter(income != "Aggregates", income != "Not classified", stja >0) %>%
ggplot(aes(stja, fill = income)) + geom_histogram(alpha = 0.7, bins = 20, color = "black") + scale_x_log10()df_stja %>% filter(income != "Aggregates", income != "Not classified", stja >0) %>%
ggplot(aes(stja, fill = income)) + geom_density(alpha = 0.3) + scale_x_log10()wdi_cache$series %>% filter(indicator == "MS.MIL.XPND.GD.ZS") %>% pull(name)[1] "Military expenditure (% of GDP)"
df_milxpnd <- WDI(indicator = c(milxpnd = "MS.MIL.XPND.GD.ZS"), extra = TRUE) %>%
drop_na(milxpnd)Default of the number of bins is 30. Here I chose binwidth = 0.5 (%).
df_milxpnd %>% filter(income != "Aggregates", year == 2021) %>%
ggplot(aes(milxpnd, fill = income)) + geom_histogram(color = "black", binwidth = 0.5) +
labs(title = "Military expenditure (% of GDP)")df_milxpnd %>% filter(income != "Aggregates", income != "Not classified", year == 2021) %>%
ggplot(aes(milxpnd, fill = income)) + geom_density(binwidth = 0.5, alpha = 0.3) +
labs(title = "Military expenditure (% of GDP)")Warning: Ignoring unknown parameters: `binwidth`
wdi_cache$series %>% filter(indicator == "SL.UEM.TOTL.ZS") %>% pull(name)[1] "Unemployment, total (% of total labor force) (modeled ILO estimate)"
df_ur <- WDI(indicator = c(ur = "SL.UEM.TOTL.ZS"),
extra = TRUE, cache = wdi_cache) %>% drop_na(ur)df_ur %>% filter(income == "Aggregates") %>% filter(grepl('income', country)) %>%
filter(year >= 2018) %>% ggplot() + geom_line(aes(x = year, y = ur, color = country))wdi_cache$series %>% filter(indicator == "FP.CPI.TOTL.ZG") %>% pull(name)[1] "Inflation, consumer prices (annual %)"
df_infl <- WDI(country = c("VN","CN","JP"),
indicator = c(inf = "FP.CPI.TOTL.ZG"), extra = TRUE) %>% drop_na(inf)df_infl %>% filter(year %in% c(2000, 2007, 2020),
country %in% c("Japan", "Vietnam", "China")) %>%
ggplot(aes(x = year, y = inf, color= country)) +
geom_line() + geom_point() +
geom_text(aes(label = scales::label_percent(accuracy=1)(inf)), nudge_y = 0.8) +
labs(title = "Inflation of Japan, Vietnam and China from 2000 to 2020",
x = "", y = "Inflation, consumer prices (annual %)")wdi_cache$series %>% filter(indicator %in% c("SI.POV.NAHC","SI.POV.MDIM")) %>% pull(name)[1] "Multidimensional poverty headcount ratio (% of total population)"
[2] "Poverty headcount ratio at national poverty lines (% of population)"
df_wdi_poverty <- WDI(
indicator = c(poverty = "SI.POV.NAHC", multipoverty = "SI.POV.MDIM", gdppercap = "NY.GDP.PCAP.KD"), start = 1990,
extra = TRUE) %>% drop_na(poverty, multipoverty, gdppercap)df_wdi_poverty %>%
group_by(country, year) %>%
mutate(mean_gdp = mean(gdppercap)) %>%
mutate(mean_poverty= mean(poverty)) %>%
ungroup() %>% filter(income != "Aggregates") %>%
ggplot(aes(x = mean_gdp)) + geom_point(aes(y = mean_poverty, color = income)) +
scale_x_log10() + geom_smooth(aes(y = mean_poverty), formula = y~x, linetype="longdash", color = "black", method = "lm", se = FALSE) +
labs(x = "GDP per capita", y = "poverty rate (% of population)", title = "Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by income level")df_wdi_poverty %>%
group_by(country, year) %>%
mutate(mean_gdp = mean(gdppercap)) %>%
mutate(mean_multipoverty= mean(multipoverty)) %>%
ungroup() %>%
filter(region !="Aggregates") %>% ggplot(aes(x = mean_gdp)) + geom_point(aes(y = mean_multipoverty, color = region)) +
scale_x_log10() +
geom_smooth(aes(y = mean_multipoverty), formula = y~x, linetype="longdash", color = "black", method = "lm", se = FALSE) + labs(x = "GDP per capita", y = "Multidimentinal poverty rate (% of population)", title = "Multidimentional Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by region")scale_y_continuous(sec.axis = sec_axis(~ . scaling_function))
Suppose you have two indicators,
WDIsearch(string = "NY.GDP.MKTP.KD", field = "indicator", short = FALSE, cache = wdi_cache)WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = wdi_cache)List the name of countries of ASEAN and BRICs using
wdi_cache$country.
asean <- c("Brunei Darussalam", "Cambodia", "Lao PDR", "Myanmar",
"Philippines", "Indonesia", "Malaysia", "Singapore")
brics <- c("Brazil", "Russian Federation", "India", "China")Find the iso2c of the countries using
wdi_cache$country.
wdi_cache$country %>%
filter(country %in%
c("Brunei Darussalam", "Cambodia", "Lao PDR", "Myanmar",
"Philippines", "Indonesia", "Malaysia", "Singapore",
"Brazil", "Russian Federation", "India", "China")) %>%
pull(iso2c) [1] "BR" "BN" "CN" "ID" "IN" "KH" "LA" "MM" "MY" "PH" "RU" "SG"
Separate the iso3c’s of the countries with commas and read data using
WDI.
wdi_gdp <- WDI(
country = c("BR", "BN", "CN", "ID", "IN", "KH", "LA", "MM", "MY", "PH", "RU", "SG"),
indicator = c(gdp = "NY.GDP.MKTP.KD", gdpPercap = "NY.GDP.PCAP.KD"),
start = 1960, extra = TRUE, cache = wdi_cache)wdi_gdp %>% filter(country %in% asean) %>% drop_na(gdp, gdpPercap) %>% summary() country iso2c iso3c year
Length:424 Length:424 Length:424 Min. :1960
Class :character Class :character Class :character 1st Qu.:1980
Mode :character Mode :character Mode :character Median :1995
Mean :1994
3rd Qu.:2008
Max. :2021
status lastupdated gdp gdpPercap
Length:424 Length:424 Min. :2.109e+09 Min. : 144.0
Class :character Class :character 1st Qu.:1.069e+10 1st Qu.: 978.2
Mode :character Mode :character Median :4.099e+10 Median : 1891.2
Mean :1.149e+11 Mean : 9718.5
3rd Qu.:1.430e+11 3rd Qu.: 8777.6
Max. :1.066e+12 Max. :66176.4
region capital longitude latitude
Length:424 Length:424 Length:424 Length:424
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
income lending
Length:424 Length:424
Class :character Class :character
Mode :character Mode :character
Using the summary, decide the scaling of two variables,
gdpPercap and gdp.
wdi_gdp %>% drop_na(gdp, gdpPercap) %>%
filter(country %in% asean) %>%
ggplot() +
geom_line(aes(x = year, y = gdpPercap, linetype = country)) +
geom_line(aes(x = year, y = gdp/(10^7), col = country)) +
coord_trans(x ="identity", y="log10") +
scale_y_continuous(sec.axis = sec_axis(~ . *(10^7), name = "gdp/(10^7)"))wdi_gdp %>% filter(country %in% brics) %>% drop_na(gdp, gdpPercap) %>% summary() country iso2c iso3c year
Length:219 Length:219 Length:219 Min. :1960
Class :character Class :character Class :character 1st Qu.:1978
Mode :character Mode :character Mode :character Median :1994
Mean :1993
3rd Qu.:2008
Max. :2021
status lastupdated gdp gdpPercap
Length:219 Length:219 Min. :1.091e+11 Min. : 163.9
Class :character Class :character 1st Qu.:3.564e+11 1st Qu.: 531.7
Mode :character Mode :character Median :9.215e+11 Median : 2758.9
Mean :1.644e+12 Mean : 3825.8
3rd Qu.:1.500e+12 3rd Qu.: 6579.5
Max. :1.580e+13 Max. :11188.3
region capital longitude latitude
Length:219 Length:219 Length:219 Length:219
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
income lending
Length:219 Length:219
Class :character Class :character
Mode :character Mode :character
Using the summary, decide the scaling of two variables,
gdpPercap and gdp.
wdi_gdp %>% drop_na(gdp, gdpPercap) %>%
filter(country %in% brics) %>%
ggplot() +
geom_line(aes(x = year, y = gdpPercap, linetype = country)) +
geom_line(aes(x = year, y = gdp/(10^9), col = country)) +
coord_trans(x ="identity", y="log10") +
scale_y_continuous(sec.axis = sec_axis(~ . *(10^7), name = "gdp/(10^9)"))